AAAI.2023 - Multiagent Systems | Cool Papers

#1 Synchronization and Diversity of Solutions [PDF] [Copy] [Kimi¹]

Authors: Emmanuel Arrighi ; Henning Fernau ; Mateus de Oliveira Oliveira ; Petra Wolf

A central computational problem in the realm of automata theory is the problem of determining whether a finite automaton A has a synchronizing word. This problem has found applications in a variety of subfields of artificial intelligence, including planning, robotics, and multi-agent systems. In this work, we study this problem within the framework of diversity of solutions, an up-and-coming trend in the field of artificial intelligence where the goal is to compute a set of solutions that are sufficiently distinct from one another. We define a notion of diversity of solutions that is suitable for contexts were solutions are strings that may have distinct lengths. Using our notion of diversity, we show that for each fixed r ∈ N, each fixed finite automaton A, and each finite automaton B given at the input, the problem of determining the existence of a diverse set {w1,w2, . . . ,wr} ⊆ L(B) of words that are synchronizing for A can be solved in polynomial time. Finally, we generalize this result to the realm of conformant planning, where the goal is to devise plans that achieve a goal irrespectively of initial conditions and of nondeterminism that may occur during their execution.

#2 The Multi-Agent Transportation Problem [PDF] [Copy] [Kimi¹]

Authors: Pascal Bachor ; Rolf-David Bergdoll ; Bernhard Nebel

We introduce the multi-agent transportation (MAT) problem, where agents have to transport containers from their starting positions to their designated goal positions. Movement takes place in a common environment where collisions between agents and between containers must be avoided. In contrast to other frameworks such as multi-agent pathfinding (MAPF) or multi-agent pickup and delivery (MAPD), the agents are allowed to separate from the containers at any time, which can reduce the makespan and also allows for plans in scenarios that are unsolvable otherwise. We present a complexity analysis establishing the problem's NP-completeness and show how the problem can be reduced to a sequence of SAT problems when optimizing for makespan. A MAT solver is empirically evaluated with regard to varying input characteristics and movement constraints and compared to a MAPD solver that utilizes conflict-based search (CBS).

#3 Emergent Quantized Communication [PDF] [Copy] [Kimi]

Authors: Boaz Carmeli ; Ron Meir ; Yonatan Belinkov

The field of emergent communication aims to understand the characteristics of communication as it emerges from artificial agents solving tasks that require information exchange. Communication with discrete messages is considered a desired characteristic, for scientific and applied reasons. However, training a multi-agent system with discrete communication is not straightforward, requiring either reinforcement learning algorithms or relaxing the discreteness requirement via a continuous approximation such as the Gumbel-softmax. Both these solutions result in poor performance compared to fully continuous communication. In this work, we propose an alternative approach to achieve discrete communication -- quantization of communicated message. Using message quantization allows us to train the model end-to-end, achieving superior performance in multiple setups. Moreover, quantization is a natural framework that runs the gamut from continuous to discrete communication. Thus, it sets the ground for a broader view of multi-agent communication in the deep learning era.

#4 Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient [PDF] [Copy] [Kimi¹]

Authors: Wubing Chen ; Wenbin Li ; Xiao Liu ; Shangdong Yang ; Yang Gao

Cooperative multi-agent policy gradient (MAPG) algorithms have recently attracted wide attention and are regarded as a general scheme for the multi-agent system. Credit assignment plays an important role in MAPG and can induce cooperation among multiple agents. However, most MAPG algorithms cannot achieve good credit assignment because of the game-theoretic pathology known as centralized-decentralized mismatch. To address this issue, this paper presents a novel method, Multi-Agent Polarization Policy Gradient (MAPPG). MAPPG takes a simple but efficient polarization function to transform the optimal consistency of joint and individual actions into easily realized constraints, thus enabling efficient credit assignment in MAPPG. Theoretically, we prove that individual policies of MAPPG can converge to the global optimum. Empirically, we evaluate MAPPG on the well-known matrix game and differential game, and verify that MAPPG can converge to the global optimum for both discrete and continuous action spaces. We also evaluate MAPPG on a set of StarCraft II micromanagement tasks and demonstrate that MAPPG outperforms the state-of-the-art MAPG algorithms.

#5 Zero-Shot Assistance in Sequential Decision Problems [PDF] [Copy] [Kimi]

Authors: Sebastiaan De Peuter ; Samuel Kaski

We consider the problem of creating assistants that can help agents solve new sequential decision problems, assuming the agent is not able to specify the reward function explicitly to the assistant. Instead of acting in place of the agent as in current automation-based approaches, we give the assistant an advisory role and keep the agent in the loop as the main decision maker. The difficulty is that we must account for potential biases of the agent which may cause it to seemingly irrationally reject advice. To do this we introduce a novel formalization of assistance that models these biases, allowing the assistant to infer and adapt to them. We then introduce a new method for planning the assistant's actions which can scale to large decision making problems. We show experimentally that our approach adapts to these agent biases, and results in higher cumulative reward for the agent than automation-based alternatives. Lastly, we show that an approach combining advice and automation outperforms advice alone at the cost of losing some safety guarantees.

#6 Multi-Unit Auctions for Allocating Chance-Constrained Resources [PDF] [Copy] [Kimi]

Authors: Anna Gautier ; Bruno Lacerda ; Nick Hawes ; Michael Wooldridge

Sharing scarce resources is a key challenge in multi-agent interaction, especially when individual agents are uncertain about their future consumption. We present a new auction mechanism for preallocating multi-unit resources among agents, while limiting the chance of resource violations. By planning for a chance constraint, we strike a balance between worst-case approaches, which under-utilise resources, and expected-case approaches, which lack formal guarantees. We also present an algorithm that allows agents to generate bids via multi-objective reasoning, which are then submitted to the auction. We then discuss how the auction can be extended to non-cooperative scenarios. Finally, we demonstrate empirically that our auction outperforms state-of-the-art techniques for chance-constrained multi-agent resource allocation in complex settings with up to hundreds of agents.

#7 Reward-Based Negotiating Agent Strategies [PDF] [Copy] [Kimi]

Authors: Ryota Higa ; Katsuhide Fujita ; Toki Takahashi ; Takumu Shimizu ; Shinji Nakadai

This study proposed a novel reward-based negotiating agent strategy using an issue-based represented deep policy network. We compared the negotiation strategies with reinforcement learning (RL) by the tournaments toward heuristics-based champion agents in multi-issue negotiation. A bilateral multi-issue negotiation in which the two agents exchange offers in turn was considered. Existing RL architectures for a negotiation strategy incorporate rich utility function that provides concrete information even though the rewards of RL are considered as generalized signals in practice. Additionally, in existing reinforcement learning architectures for negotiation strategies, both the issue-based representations of the negotiation problems and the policy network to improve the scalability of negotiation domains are yet to be considered. This study proposed a novel reward-based negotiation strategy through deep RL by considering an issue-based represented deep policy network for multi-issue negotiation. Comparative studies analyzed the significant properties of negotiation strategies with RL. The results revealed that the policy-based learning agents with issue-based representations achieved comparable or higher utility than the state-of-the-art baselines with RL and heuristics, especially in the large-sized domains. Additionally, negotiation strategies with RL based on the policy network can achieve agreements by effectively using each step.

#8 Intersection Coordination with Priority-Based Search for Autonomous Vehicles [PDF] [Copy] [Kimi]

Authors: Jiaoyang Li ; The Anh Hoang ; Eugene Lin ; Hai L. Vu ; Sven Koenig

The development of connected and autonomous vehicles opens an opportunity to manage intersections without signals. One promising approach is to use a central autonomous intersection manager to optimize the movement of the vehicles in the intersection. Existing work uses Mixed Integer Linear Programming (MILP) to find optimal solutions for this problem but is time-consuming and cannot be applied in real-time. On the other hand, the coordination of the vehicles is essentially a Multi-Agent Path Finding (MAPF) problem, for which dozens of efficient algorithms have been proposed in recent years. Inspired by these MAPF algorithms, we propose a three-level algorithm called PSL to solve the intersection coordination problem. Theoretically, PSL is complete and polynomial-time in the number of vehicles. Empirically, PSL runs significantly faster with only a slight compromise in the solution quality than the optimal MILP method. It also generates significantly better solutions with a slightly larger runtime than the traditional First-Come-First-Served strategy.

#9 Solving Large-Scale Pursuit-Evasion Games Using Pre-trained Strategies [PDF] [Copy] [Kimi]

Authors: Shuxin Li ; Xinrun Wang ; Youzhi Zhang ; Wanqi Xue ; Jakub Černý ; Bo An

Pursuit-evasion games on graphs model the coordination of police forces chasing a fleeing felon in real-world urban settings, using the standard framework of imperfect-information extensive-form games (EFGs). In recent years, solving EFGs has been largely dominated by the Policy-Space Response Oracle (PSRO) methods due to their modularity, scalability, and favorable convergence properties. However, even these methods quickly reach their limits when facing large combinatorial strategy spaces of the pursuit-evasion games. To improve their efficiency, we integrate the pre-training and fine-tuning paradigm into the core module of PSRO -- the repeated computation of the best response. First, we pre-train the pursuer's policy base model against many different strategies of the evader. Then we proceed with the PSRO loop and fine-tune the pre-trained policy to attain the pursuer's best responses. The empirical evaluation shows that our approach significantly outperforms the baselines in terms of speed and scalability, and can solve even games on street maps of megalopolises with tens of thousands of crossroads -- a scale beyond the effective reach of previous methods.

#10 Contrastive Identity-Aware Learning for Multi-Agent Value Decomposition [PDF] [Copy] [Kimi]

Authors: Shunyu Liu ; Yihe Zhou ; Jie Song ; Tongya Zheng ; Kaixuan Chen ; Tongtian Zhu ; Zunlei Feng ; Mingli Song

Value Decomposition (VD) aims to deduce the contributions of agents for decentralized policies in the presence of only global rewards, and has recently emerged as a powerful credit assignment paradigm for tackling cooperative Multi-Agent Reinforcement Learning (MARL) problems. One of the main challenges in VD is to promote diverse behaviors among agents, while existing methods directly encourage the diversity of learned agent networks with various strategies. However, we argue that these dedicated designs for agent networks are still limited by the indistinguishable VD network, leading to homogeneous agent behaviors and thus downgrading the cooperation capability. In this paper, we propose a novel Contrastive Identity-Aware learning (CIA) method, explicitly boosting the credit-level distinguishability of the VD network to break the bottleneck of multi-agent diversity. Specifically, our approach leverages contrastive learning to maximize the mutual information between the temporal credits and identity representations of different agents, encouraging the full expressiveness of credit assignment and further the emergence of individualities. The algorithm implementation of the proposed CIA module is simple yet effective that can be readily incorporated into various VD architectures. Experiments on the SMAC benchmarks and across different VD backbones demonstrate that the proposed method yields results superior to the state-of-the-art counterparts. Our code is available at https://github.com/liushunyu/CIA.

#11 Learning to Shape Rewards Using a Game of Two Partners [PDF] [Copy] [Kimi]

Authors: David Mguni ; Taher Jafferjee ; Jianhong Wang ; Nicolas Perez-Nieves ; Wenbin Song ; Feifei Tong ; Matthew Taylor ; Tianpei Yang ; Zipeng Dai ; Hui Chen ; Jiangcheng Zhu ; Kun Shao ; Jun Wang ; Yaodong Yang

Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construc- tion is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA’s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.

#12 Reconstructing an Epidemic Outbreak Using Steiner Connectivity [PDF] [Copy] [Kimi]

Authors: Ritwick Mishra ; Jack Heavey ; Gursharn Kaur ; Abhijin Adiga ; Anil Vullikanti

Only a subset of infections is actually observed in an outbreak, due to multiple reasons such as asymptomatic cases and under-reporting. Therefore, reconstructing an epidemic cascade given some observed cases is an important step in responding to such an outbreak. A maximum likelihood solution to this problem ( referred to as CascadeMLE ) can be shown to be a variation of the classical Steiner subgraph problem, which connects a subset of observed infections. In contrast to prior works on epidemic reconstruction, which consider the standard Steiner tree objective, we show that a solution to CascadeMLE, based on the actual MLE objective, has a very different structure. We design a logarithmic approximation algorithm for CascadeMLE, and evaluate it on multiple synthetic and social contact networks, including a contact network constructed for a hospital. Our algorithm has significantly better performance compared to a prior baseline.

#13 Formal Verification of Bayesian Mechanisms [PDF] [Copy] [Kimi]

Authors: Munyque Mittelmann ; Bastien Maubert ; Aniello Murano ; Laurent Perrussel

In this paper, for the first time, we study the formal verification of Bayesian mechanisms through strategic reasoning. We rely on the framework of Probabilistic Strategy Logic (PSL), which is well-suited for representing and verifying multi-agent systems with incomplete information. We take advantage of the recent results on the decidability of PSL model checking under memoryless strategies, and reduce the problem of formally verifying Bayesian mechanisms to PSL model checking. We show how to encode Bayesian-Nash equilibrium and economical properties, and illustrate our approach with different kinds of mechanisms.

#14 Memory-Augmented Theory of Mind Network [PDF] [Copy] [Kimi]

Authors: Dung Nguyen ; Phuoc Nguyen ; Hung Le ; Kien Do ; Svetha Venkatesh ; Truyen Tran

Social reasoning necessitates the capacity of theory of mind (ToM), the ability to contextualise and attribute mental states to others without having access to their internal cognitive structure. Recent machine learning approaches to ToM have demonstrated that we can train the observer to read the past and present behaviours of other agents and infer their beliefs (including false beliefs about things that no longer exist), goals, intentions and future actions. The challenges arise when the behavioural space is complex, demanding skilful space navigation for rapidly changing contexts for an extended period. We tackle the challenges by equipping the observer with novel neural memory mechanisms to encode, and hierarchical attention to selectively retrieve information about others. The memories allow rapid, selective querying of distal related past behaviours of others to deliberatively reason about their current mental state, beliefs and future behaviours. This results in ToMMY, a theory of mind model that learns to reason while making little assumptions about the underlying mental processes. We also construct a new suite of experiments to demonstrate that memories facilitate the learning process and achieve better theory of mind performance, especially for high-demand false-belief tasks that require inferring through multiple steps of changes.

#15 Socially Optimal Non-discriminatory Restrictions for Continuous-Action Games [PDF] [Copy] [Kimi]

Authors: Michael Oesterle ; Guni Sharon

We address the following mechanism design problem: Given a multi-player Normal-Form Game (NFG) with a continuous action space, find a non-discriminatory (i.e., identical for all players) restriction of the action space which maximizes the resulting Nash Equilibrium with respect to a fixed social utility function. First, we propose a formal model of a Restricted Game and the corresponding restriction optimization problem. We then present an algorithm to find optimal non-discriminatory restrictions under some assumptions. Our experimental results with Braess' Paradox and the Cournot Game show that this method leads to an optimized social utility of the Nash Equilibria, even when the assumptions are not guaranteed to hold. Finally, we outline a generalization of our approach to the much wider scope of Stochastic Games.

#16 Fault-Tolerant Offline Multi-Agent Path Planning [PDF] [Copy] [Kimi]

Authors: Keisuke Okumura ; Sébastien Tixeuil

We study a novel graph path planning problem for multiple agents that may crash at runtime, and block part of the workspace. In our setting, agents can detect neighboring crashed agents, and change followed paths at runtime. The objective is then to prepare a set of paths and switching rules for each agent, ensuring that all correct agents reach their destinations without collisions or deadlocks, despite unforeseen crashes of other agents. Such planning is attractive to build reliable multi-robot systems. We present problem formalization, theoretical analysis such as computational complexities, and how to solve this offline planning problem.

#17 LaCAM: Search-Based Algorithm for Quick Multi-Agent Pathfinding [PDF] [Copy] [Kimi]

Author: Keisuke Okumura

We propose a novel complete algorithm for multi-agent pathfinding (MAPF) called lazy constraints addition search for MAPF (LaCAM). MAPF is a problem of finding collision-free paths for multiple agents on graphs and is the foundation of multi-robot coordination. LaCAM uses a two-level search to find solutions quickly, even with hundreds of agents or more. At the low-level, it searches constraints about agents' locations. At the high-level, it searches a sequence of all agents' locations, following the constraints specified by the low-level. Our exhaustive experiments reveal that LaCAM is comparable to or outperforms state-of-the-art sub-optimal MAPF algorithms in a variety of scenarios, regarding success rate, planning time, and solution quality of sum-of-costs.

#18 Networked Anti-coordination Games Meet Graphical Dynamical Systems: Equilibria and Convergence [PDF] [Copy] [Kimi]

Authors: Zirou Qiu ; Chen Chen ; Madhav V. Marathe ; S. S. Ravi ; Daniel J. Rosenkrantz ; Richard E. Stearns ; Anil Vullikanti

Evolutionary anti-coordination games on networks capture real-world strategic situations such as traffic routing and market competition. Two key problems concerning evolutionary games are the existence of a pure Nash equilibrium (NE) and the convergence time. In this work, we study these two problems for anti-coordination games under sequential and synchronous update schemes. For each update scheme, we examine two decision modes based on whether an agent considers its own previous action (self essential) or not (self non-essential) in choosing its next action. Using a relationship between games and dynamical systems, we show that for both update schemes, finding an NE can be done efficiently under the self non-essential mode but is computationally intractable under the self essential mode. We then identify special cases for which an NE can be obtained efficiently. For convergence time, we show that the dynamics converges in a polynomial number of steps under the synchronous scheme; for the sequential scheme, the convergence time is polynomial only under the self non-essential mode. Through experiments, we empirically examine the convergence time and the equilibria for both synthetic and real-world networks.

#19 Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning [PDF¹] [Copy] [Kimi¹]

Authors: Qi Tian ; Kun Kuang ; Furui Liu ; Baoxiang Wang

Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the utility of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e., multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.

#20 Resource Sharing through Multi-Round Matchings [PDF] [Copy] [Kimi]

Authors: Yohai Trabelsi ; Abhijin Adiga ; Sarit Kraus ; S. S. Ravi ; Daniel J. Rosenkrantz

Applications such as employees sharing office spaces over a workweek can be modeled as problems where agents are matched to resources over multiple rounds. Agents' requirements limit the set of compatible resources and the rounds in which they want to be matched. Viewing such an application as a multi-round matching problem on a bipartite compatibility graph between agents and resources, we show that a solution (i.e., a set of matchings, with one matching per round) can be found efficiently if one exists. To cope with situations where a solution does not exist, we consider two extensions. In the first extension, a benefit function is defined for each agent and the objective is to find a multi-round matching to maximize the total benefit. For a general class of benefit functions satisfying certain properties (including diminishing returns), we show that this multi-round matching problem is efficiently solvable. This class includes utilitarian and Rawlsian welfare functions. For another benefit function, we show that the maximization problem is NP-hard. In the second extension, the objective is to generate advice to each agent (i.e., a subset of requirements to be relaxed) subject to a budget constraint so that the agent can be matched. We show that this budget-constrained advice generation problem is NP-hard. For this problem, we develop an integer linear programming formulation as well as a heuristic based on local search. We experimentally evaluate our algorithms on synthetic networks and apply them to two real-world situations: shared office spaces and matching courses to classrooms.

#21 Effective Integration of Weighted Cost-to-Go and Conflict Heuristic within Suboptimal CBS [PDF] [Copy] [Kimi]

Authors: Rishi Veerapaneni ; Tushar Kusnur ; Maxim Likhachev

Conflict-Based Search (CBS) is a popular multi-agent path finding (MAPF) solver that employs a low-level single agent planner and a high-level constraint tree to resolve conflicts. The vast majority of modern MAPF solvers focus on improving CBS by reducing the size of this tree through various strategies with few methods modifying the low level planner. Typically low level planners in existing CBS methods use an unweighted cost-to-go heuristic, with suboptimal CBS methods also using a conflict heuristic to help the high level search. In this paper, we show that, contrary to prevailing CBS beliefs, a weighted cost-to-go heuristic can be used effectively alongside the conflict heuristic in two possible variants. In particular, one of these variants can obtain large speedups, 2-100x, across several scenarios and suboptimal CBS methods. Importantly, we discover that performance is related not to the weighted cost-to-go heuristic but rather to the relative conflict heuristic weight's ability to effectively balance low-level and high-level work. Additionally, to the best of our knowledge, we show the first theoretical relation of prioritized planning and bounded suboptimal CBS and demonstrate that our methods are their natural generalization.

#22 DM²: Decentralized Multi-Agent Reinforcement Learning via Distribution Matching [PDF] [Copy] [Kimi]

Authors: Caroline Wang ; Ishan Durugkar ; Elad Liebman ; Peter Stone

Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to centralized components or explicit communication. It examines the use of distribution matching to facilitate the coordination of independent agents. In the proposed scheme, each agent independently minimizes the distribution mismatch to the corresponding component of a target visitation distribution. The theoretical analysis shows that under certain conditions, each agent minimizing its individual distribution mismatch allows the convergence to the joint policy that generated the target distribution. Further, if the target distribution is from a joint policy that optimizes a cooperative task, the optimal policy for a combination of this task reward and the distribution matching reward is the same joint policy. This insight is used to formulate a practical algorithm (DM^2), in which each individual agent matches a target distribution derived from concurrently sampled trajectories from a joint expert policy. Experimental validation on the StarCraft domain shows that combining (1) a task reward, and (2) a distribution matching reward for expert demonstrations for the same task, allows agents to outperform a naive distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled to obtain the learning benefits.

#23 Emergence of Punishment in Social Dilemma with Environmental Feedback [PDF] [Copy] [Kimi]

Authors: Zhen Wang ; Zhao Song ; Chen Shen ; Shuyue Hu

Altruistic punishment (or punishment) has been extensively shown as an important mechanism for promoting cooperation in human societies. In AI, the emergence of punishment has received much recent interest. In this paper, we contribute with a novel evolutionary game theoretic model to study the impacts of environmental feedback. Whereas a population of agents plays public goods games, there exists a third-party population whose payoffs depend not only on whether to punish or not, but also on the state of the environment (e.g., how cooperative the agents in a social dilemma are). Focusing on one-shot public goods games, we show that environmental feedback, by itself, can lead to the emergence of punishment. We analyze the co-evolution of punishment and cooperation, and derive conditions for their co-presence, co-dominance and co-extinction. Moreover, we show that the system can exhibit bistability as well as cyclic dynamics. Our findings provide a new explanation for the emergence of punishment. On the other hand, our results also alert the need for careful design of implementing punishment in multi-agent systems, as the resulting evolutionary dynamics can be somewhat complex.

#24 Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks [PDF] [Copy] [Kimi]

Authors: Pei Xu ; Junge Zhang ; Qiyue Yin ; Chao Yu ; Yaodong Yang ; Kaiqi Huang

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

#25 Consensus Learning for Cooperative Multi-Agent Reinforcement Learning [PDF] [Copy] [Kimi]

Authors: Zhiwei Xu ; Bin Zhang ; Dapeng Li ; Zeren Zhang ; Guangchong Zhou ; Hao Chen ; Guoliang Fan

Almost all multi-agent reinforcement learning algorithms without communication follow the principle of centralized training with decentralized execution. During the centralized training, agents can be guided by the same signals, such as the global state. However, agents lack the shared signal and choose actions given local observations during execution. Inspired by viewpoint invariance and contrastive learning, we propose consensus learning for cooperative multi-agent reinforcement learning in this study. Although based on local observations, different agents can infer the same consensus in discrete spaces without communication. We feed the inferred one-hot consensus to the network of agents as an explicit input in a decentralized way, thereby fostering their cooperative spirit. With minor model modifications, our suggested framework can be extended to a variety of multi-agent reinforcement learning algorithms. Moreover, we carry out these variants on some fully cooperative tasks and get convincing results.